Linear Model is one of the simplest model, we use this section as the starting point to introduce some common ideas
in
Data Science.
Least Square Problem
Click on the Canvas to place additional points
Suppose we have $n$ data points, $x_1,\cdots,x_n$, in $\mathbb{R}^p $, denote $$\mathbf{X} \in
\mathbb{R}^{n\times p }$$
the matrix with entries given by $$\mathbf{X}_{i,j}=(x_i)_j$$
For each point, $x_i$, there is a corresponding output $y_i$, we will denote the output vector by
$$\mathbf{y} := \begin{pmatrix} y_1\\\vdots\\y_n \end{pmatrix}$$
Denote the design matrix
$$X:=[\mathbf{X}|\mathbb{1}] \in \mathbb{R}^{n\times (p+1) }$$
The goal of a least square problem is to find the vector $\mathbf{\beta}\in \mathbb{R}^{p+1}$
that minimizes the Sum of Squared Error
$$\mathrm{SSE}(w):=\|X w - \mathbf{y}\|_2^2 $$
Geometric Approach
Projection Matrix
We start with the geometric way, the one that I found most natural. Recall that we have the following
minimization problem,
$$\min_{w\in \R^{p+1}} \|X w - \mathbf{y}\|_2^2$$
The object $$Xw$$ lives in the column space of $X$, thus to minimizes the above squared distance,
we only need to require
$Xw= \widehat{\mathbf{y}}$ where
$ \widehat{\mathbf{y}}$ is the projection of $\mathbf{y}$ onto the column space of $X$. In other words, recall that
the projection can be done via the projection matrix $X( X^T X)^{-1}X^T$, we need to have
$$Xw= \widehat{\mathbf{y}} =X( X^T X)^{-1}X^T \mathbf{y}$$
Which implies
$$ w = ( X^T X)^{-1}X^T \mathbf{y}$$
Since $$\hat{\mathbf{y}} = Xw =X( X^T X)^{-1}X^T \mathbf{y} $$
The following projection matrix is also often called the hat matrix
$$P = X( X^T X)^{-1}X^T$$
Correlations, R Squared and F-statistics
Correlations
Denoting the $i$-th column of $X$ by $X_i$. The standard definition of correlation can be
rewritten as
$$
\begin{aligned}
\mathtt{Corr}\left(\mathbf{y}, X_{i}\right):&=\frac{\sum(\mathbf{y}_j-\overline{\mathbf{y}})
(X_{ji}-\overline{X_{i}})}{\sqrt{\sum(
X_{ji}-\overline{X_{i}})^2\sum(\mathbf{y_j}-\overline{\mathbf{y}})^2}}
\\ &=\frac{\left(\mathbf{y}-\overline{\mathbf{y}}\right)\cdot\left(
X_{i}-\overline{X_{i}}\right)}{\|X_i\|\|\mathbf{y}\|}\\ &=\cos
\left(\angle\left(\mathbf{y}-\overline{\mathbf{y}},
X_{i}-\overline{X_{i}}\right)\right)
\end{aligned}
$$
$R^2$
Similarly, using the decomposition
$$SST=SSE+SSR$$
$$R^{2}:=\frac{SSR}{SST}=\cos ^{2}(\angle(\mathbf{y}-\overline{\mathbf{y}},
\widehat{\mathbf{y}}-\overline{\mathbf{y}}))$$
The loss function can be expanded
$$\mathrm{SSE}(b)=b^TX^TXb-2(Xb)^T\mathbf{y}+\mathbf{y}^T\mathbf{y} $$
Taking derivative (gradient) w.r.t $b$ gives
$$\nabla \mathrm{SSE} = 2X^TX b - 2X^T \mathbf{y}$$
At the local minimals, we have
$$\nabla \mathrm{SSE} = 2X^TX b - 2X^T \mathbf{y}=0$$
which gives us the, Normal Equation
$$X^TX b= X^T \mathbf{y}$$
When the inverse of $X^TX$ exist, we have, again $$ b = ( X^T X)^{-1}X^T \mathbf{y}$$
For ideas behind the Gradient Descend Algorithm, see the page Optimization
Methods. In the example, we applied the Gradient Descend algorithm to minimize $\mathrm{SSE}(b)$.
The updating rule is given by
$$b_{j+1}=b_{j}-\alpha X^{T}(Xb-Y)$$
The Hessian of $\mathrm{SSE}$ is given by $$H_\mathrm{SSE}(b)=J_{\nabla \mathrm{SSE}}( b )=2X^TX$$
Thus $H_\mathrm{SSE} (b)$ is positive definite, locally by the Derivative
Test, any local optimal of $\mathrm{SSE}(b)$ will be a local minimum.
And globally $H_\mathrm{SSE} (b) $ positive definite for all $b$ implies $\mathrm{SSE}(b)$ is convex , and
thus our local minimum is the unique global minimum.
Parametric Approach
Linear Model Assumptions
When regard Linear Regression Problems as problems of fitting the least square line to the data
points,
it is just a mathematical problem. But when one tries to do Statistical Inference then it becomes a
statistical problem.
And assumptions are needed for meaningful interpretations.
A Linear Model in Regression is a model of the form $$\mathbf{y}=X b+\mathcal{E} $$
Where the residual is a random vector with distribution $\mathcal{E}\sim \mathcal{N}(0,\sigma I)$.
Maximum Likelihood
Let's take a small quiz:
Let $\mathbf{x}$ be a mean $1$ Gaussian Random variable, and we have 4 realizations of
$\mathbf{x}$
Which of the following choice is a more likely distribution for $\mathbf{x}$ ?
The idea of maximum Likelihood essentially formalizes the above example. Let $V$
be a random vector distributed according to the following density, a function in $V$,
$$ p(V|\theta)$$
Given a realization $v$ of $V$, the likelihood function is defined by the following, a function in
$\theta$,
$$L(\theta): = p(V|\theta)$$
And the maximum likelihood estimator for $\theta$ is given by
$$ \mathrm{argmax}\ L(\theta) $$
Since the $\ln$ function is monotonically increasing, and turns products into sums, it is often easier to work
with
the log-likelihood function
$$l(\theta)=\ln L(\theta)$$
And the corresponding optimization problem
$$ \mathrm{argmax}\ l(\theta) $$
In the linear regression model, it turns out that the MLE for $b$ does not depend on $\sigma$, thus for
simplicity
we will assume
$$\mathbf{y}=X b+\mathcal{E}\sim \mathcal{N}(Xb,I) $$
Given $\mathbf{y}$, $X$ fixed, the Likelihood function for $b$ is given by
$$L(b)=(2 \pi)^{-\frac{n}{2}}
e^{-\frac{1}{2}(\mathbf{y} -Xb)^{T}(\mathbf{y} -Xb)}, $$
and the log-likelihood is given by $$l(b)=\text{constant}-\frac{1}{2}(\mathbf{y} -Xb)^{T}(\mathbf{y} -Xb)
$$
Note that from here we see that the maximization problem for MLE is equivalent to the minimization problem for
$\mathrm{SSE}(b)$.
Both the Derivative Test, and Gradient Descend/Ascend can be used to find or estimate the maximum
likelihood estimator.
But let us use this opportunity to apply Newton's method.
Newton's Method
The idea behind Newton's Method is discussed on the page Optimization
Methods. We go straight to its application to maximizing the log-likelihood function.
To maximize $$l(b)=\text{constant}-\frac{1}{2}(\mathbf{y} -Xb)^{T}(\mathbf{y} -Xb) $$
We apply Newton's method to
$$f(b):=\nabla l(b) = -X^TX b + X^T \mathbf{y}$$
We have, the Jacobian of $f$ given by $$J_f(b) = H_l(b)=-X^TX$$
Note that $J_f(b)$ is negative definite, thus locally by the Derivative
Test, any local optimal of $l(b)$ will be a local maximum.
And globally $J_f(b) $ negative definite for all $b$ implies $l(b)$ is concave , and thus our local
maximum
is the unique global maximum.
Given any initial guess $b $, one step of Newton's Method gives
$$\begin{align}
\beta&=b-J_f(b)^{-1}f(b)\\
& =b+(X^TX)^{-1}(-X^TXb+X^Ty)\\
&=(X^TX)^{-1}X^Ty
\end{align}$$
i.e. only one step is needed.
The application of Newton's method to Linear Regression requires only one step, since $f(b)$ is a affine map in
$b$,
i.e. it is of the form $f(b) = Mb+v$.
A simple nontrivial application of Newton's Method can be found on the page Optimization
Methods.
Regularizations
Regularization is a technique used to prevent overfitting in linear regression. It adds a penalty term to the loss
function, which encourages the model to have smaller coefficients. This helps to prevent the model from fitting the
noise in the data and instead focuses on the underlying relationship between the features and the target variable.
Ridge Regression & Lasso Regression
Lasso Regression
Uses absolute value penalty for coefficient sparsity:
$$\text{Regularized Loss} = \text{Original Loss} + \lambda \sum_{i=1}^{p} |\beta_i|$$
Key properties:
Feature selection: Drives irrelevant coefficients to exactly zero.
Scale sensitivity: Requires feature standardization for proper shrinkage.
Sparse solutions: Ideal for "large p, small n" problems.
Best for:
Interpretable models with clear feature importance.
Ridge Regression
Adds a penalty proportional to the square of coefficients to the loss function:
$$\text{Regularized Loss} = \text{Original Loss} + \lambda \sum_{i=1}^{p} \beta_i^2$$
Key properties:
Smooth shrinkage: Reduces coefficient magnitudes without eliminating features.
Multicollinearity handling: Distributes weights among correlated variables.
Requires tuning both $\lambda$ (strength) and $\alpha$ (mix ratio).
Logistic Regression
log-odds
A Linear Model assumes $$y\sim
\mathcal{N}(\mathbf{b}^T\mathbf{x},\sigma)$$ i.e. at any given point $\mathbf{x}$, it assumes
$y= f(\mathbf{x})$ follows a Gaussian distribution, this excludes us from modeling the case where
$y=f(\mathbf{x})$
follows other distributions.
Consider for example, in a binary Classification problem with labels $0$ and $1$. Then at any point
$\mathbf{x}$, it is reasonable to assume that
$$f(\mathbf{x})\sim \mathtt{Bernoulli}(p(\mathbf{x}))$$
And this leads us naturally to the Logistic Regression.
The Gradient Descend algorithm is initially ran for 20 steps.
Use buttons below to train further.
Use the following button to turn on or off the linear regression line.
When use linear regression for classification, we round the prediction to the nearest
label.
In Logistic Regression, we assume for some parameter $\theta$,
$$y\sim
\mathtt{Bernoulli}(p_\theta{(\mathbf{x})})$$
Recall that the odds of $y=1$ conditioned on $\mathbf{x}$ is given by
$$\frac{\mathtt{Pr}(y=1\mid \mathbf{x})}{\mathtt{Pr}(y=0\mid \mathbf{x})}=
\frac{p_\theta{(\mathbf{x})}}{1-p_\theta{(\mathbf{x})}}\geq 0$$
Consider apply the monotonic function $\log$ to the above, we get the log-odds of
$y=1$ or equivalently the logit function
on $p_\theta{(\mathbf{x})}$, i.e.
$$\mathtt{logit}(p_\theta{(\mathbf{x})}):=\log\left(\frac{p_\theta{(\mathbf{x})}}{1-p_\theta{(\mathbf{x})}}\right)\in
\mathbb{R}$$
and a naive additional assumption is that the log-odds is a linear function of
$\mathbf{x}$, i.e.
$$\log\left(\frac{p_\theta{(\mathbf{x})}}{1-p_\theta{(\mathbf{x})}}\right)=\theta\cdot \mathbf{x}$$
thus by the above assumptions, we have
$$p_\theta{(\mathbf{x})}= \frac{1}{1+\exp(-\theta\cdot
\mathbf{x})}=:\mathtt{sigmoid}(\theta\cdot \mathbf{x})$$
Where sigmoid is the inverse of the logit function.
And the goal of logistic regression is try to estimate the probabilities $p_\theta(\mathbf{x})$.
Since the sigmoid function is not linear, it is more connivent to use MLE to estimate
$\theta$.
To maximize the likelihood function, is equivalent to minimize the negative of the log-likelihood function
$$L(\theta)=-\sum_{i=1}^{n} y^{(i)} \log \left(p_\theta(\mathbf{x}^{(i)})\right)+\left(1-y^{(i)}\right) \log
\left(1-p_\theta(\mathbf{x}^{(i)})\right)$$
The summands are$$\begin{cases}\log \left(p_\theta(\mathbf{x}^{(i)})\right),\qquad &
y^{(i)}=1\\
\log \left(1-p_\theta(\mathbf{x}^{(i)})\right) & y^{(i)}=0 \end{cases}
$$
The above loss function is also known as the Binary Cross Entropy function.
Using Gradient Descend, we get the following updating rule $$\theta_{j+1}=\theta_{j}-\alpha
X^{T}(p_\theta(X)-Y)$$
Where $X$ is the design matrix.
Model Coefficients
The logistic regression model can be expressed as:
$$
\log\left(\frac{P(y=1)}{1-P(y=1)}\right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n
$$
Taking the exponential of both sides, we get:
$$
\begin{align}
\text{odds}(y=1) &= \exp(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n)\\
&= \exp(\beta_0)\exp(\beta_1 x_1)\exp(\beta_2 x_2)\cdots\exp(\beta_n x_n)
\end{align}
$$
The exponential of the coefficients $$\exp(\beta_j)$$
gives the relation between the odds of $y=1$ and the corresponding feature $x_j$.
Generalized Linear Models
A Generalized Linear Model is of
the form
$$\eta\left(\mathrm{E}\left(Y \mid \bf x\right)\right)= \theta \cdot \bf x',\qquad \bf x' = \begin{pmatrix}1\\
\bf x \end{pmatrix} $$
for some link function $\eta$, and we assume that
the entries of $ Y$ are independently sampled from a Exponential Family.
Examples of GLM
The Linear Model is a GLM , where the link function is the identity function, and
$Y$ is from
the Gaussian Distribution (an exponential family).
The Logistic Regression Model is a GLM, where the link function is the logit
function
$$\eta(p)=\log \left(\frac{p}{1-p}\right)$$
and $Y$
is from the Bernoulli Distribution (another example of exponential family).
If we let $y$ takes on the following PMF
$$\mathtt{Pr}(y=k \mid\mathbf{x})= \begin{cases}\frac{\exp(\beta_k \cdot
\bf x)}{1+\sum_{l=1}^{K-1}\exp(\beta_l\cdot \mathbf{x})} \quad & k = 1,2,\cdots,K-1 \\
\frac{1}{1+\sum_{l=1}^{K-1}\exp(\beta_l\cdot \bf x)} & k=K \end{cases} $$
and $$\eta(p) = \log(\frac{p}{\mathtt{Pr}(y=K| \mathbf{x})}) $$ we get the Multinomial
Logistic
Regression.
If we let $Y\sim \mathtt{Poisson}(\lambda)$, and take $\eta = \log$, then we get the Poisson Regression
Model.
$$
\lambda:=\mathrm{E}(Y \mid x)=e^{\theta^{\prime} x}
$$
$$
p(y \mid x ; \theta)=\frac{\lambda^y}{y !} e^{-\lambda}=\frac{e^{y \theta^{\prime} x} e^{-e^{\theta^{\prime}
x}}}{y !}
$$
Modeling Default Rates
$$\log(\lambda) = \beta_0 + \beta_1 (\text{unemployment rate}) + \beta_2
(\text{loan-to-income ratio}) +
\log(\text{exposure})$$
Where $\lambda$ is the expected default count, and $\text{exposure}$ accounts for loan volume
Mixed Effect Models
A Linear Mixed Effect Model is of the form $$\mathbf{y} = X\beta + U\gamma +
\mathbf{\epsilon}$$
Where
If $U$ is the zero matrix, then the model is just a GLM.
import statsmodels.api as sm
import statsmodels.formula.api as smf
# Example data
data = sm.datasets.get_rdataset("dietox", "geepack").data
# Fit model
model = smf.mixedlm("Weight ~ Time", data, groups=data["Pig"])
results = model.fit()
# Predict fixed-effects only
fixed_predictions = results.predict(data)
# Predict with random effects
random_effects = results.random_effects
group_predictions = fixed_predictions + data["Pig"].map(random_effects)
Neural Networks
Logistic Regression or Linear Model can be viewed as a single layer neural network with sigmoid /
identity activation function
and Cross Entropy loss function. We will discuss the neural network in more detail in a later page Neural Networks.